First, I wanted to investigate which variables are time dependent and also exclude some that were clearly unnecessary (i.e., “SITE”,“COLPROT”,“ORIGPROT”, “FLDSTRENG”,“FSVERSION”,“IMAGEUID”, “Month_bl”,“Month”,“M”,“update_stamp”).
Merge time dependent and independent variables into the long_dat data frame. Also, I recoded the time points in the VISCODE variable into integers.
long_dat <- dat[, c(ivars[,1], nivars[,1])] %>%
mutate(VISCODE = match(VISCODE, c("bl", "m03", "m06", "m12", "m18", "m24",
"m30","m36", "m42", "m48", "m54", "m60",
"m66", "m72","m78", "m84", "m90", "m96",
"m102", "m108","m114", "m120", "m126",
"m132", "m144", "m156"))-1) %>%
relocate(RID, PTID, VISCODE) %>%
arrange(RID, VISCODE)
In the original data frame there were quite some _bl or _BL variables. Thus, I wanted to check whether these columns had already been integrated or not at each corresponding time point for each participant. Surprise, the test was negative.
Therefore, I continued with merging the _bl/_BL variables with the corresponding time dependent variable for each participant. Additionally, I specified the data type of each variable individually for optimal control and oversight over the data structure.
Transform Long to Wide Data Format
## # A tibble: 6 × 1,153
## RID PTID AGE PTGENDER PTEDUCAT PTETHCAT PTRACCAT PTMARRY APOE4 FDG_0
## <fct> <chr> <dbl> <fct> <int> <fct> <fct> <fct> <int> <dbl>
## 1 2 011_S_0002 74.3 Male 16 Not His… White Married 0 1.37
## 2 3 011_S_0003 81.3 Male 18 Not His… White Married 1 1.08
## 3 4 022_S_0004 67.5 Male 10 Hisp/La… White Married 0 NA
## 4 5 011_S_0005 73.7 Male 16 Not His… White Married 0 1.29
## 5 6 100_S_0006 80.4 Female 13 Not His… White Married 0 NA
## 6 7 022_S_0007 75.4 Male 10 Hisp/La… More th… Married 1 NA
## # ℹ 1,143 more variables: FDG_2 <dbl>, FDG_7 <dbl>, FDG_11 <dbl>, FDG_12 <dbl>,
## # FDG_13 <dbl>, FDG_14 <dbl>, FDG_15 <dbl>, FDG_16 <dbl>, FDG_17 <dbl>,
## # FDG_18 <dbl>, FDG_19 <dbl>, FDG_21 <dbl>, FDG_22 <dbl>, FDG_23 <dbl>,
## # FDG_24 <dbl>, FDG_3 <dbl>, FDG_4 <dbl>, FDG_5 <dbl>, FDG_6 <dbl>,
## # FDG_9 <dbl>, FDG_8 <dbl>, FDG_10 <dbl>, FDG_25 <dbl>, FDG_20 <dbl>,
## # FDG_1 <dbl>, PIB_0 <dbl>, PIB_2 <dbl>, PIB_7 <dbl>, PIB_11 <dbl>,
## # PIB_12 <dbl>, PIB_13 <dbl>, PIB_14 <dbl>, PIB_15 <dbl>, PIB_16 <dbl>, …
Based on the number of participants measured at any time point I made a frequency plot to get a first idea of the sampling frequency.
Based on these findings it appears that time point 9 is a cut-off where the number of measurements drop quite strongly. Time point 9 corresponds to month 42 (i.e., 3.5 years) of the follow-up.
The merge(by.x, by.y) function creates a new data frame that only keeps those rows for which there is a matching key (in our case PTID). Therefore, we do have genetic data from 2 additional individuals for which we do not have any other measurements. The final data frame for which testing data and genetic data is available is thus, 1408 (N).
Based on this plot, we can see a positive relationship between the polygenic score for education attainment and actual years of education. This means that with a higher PGS score comes higher genetic capacity for educational attainment.
We ran Pearson’s correlation which resulted in r = 0.286 (p-value < 2.2e-16)
To get the residual we regressed the polygenic risk score for educational attainment against actual EA including the variables SEX & AGE as covariates. The results are depicted in the density plot.
It is important to correctly interpret the residual scores. The correct way to interpret them is, that a high residual score means that the individual has over-performed relative to his or her genetic capacity. See for example in this table for a short proof:
## Actual Predicted Residuals
## 1 18 16.91911 1.0808864
## 2 16 15.16815 0.8318481
## 3 12 16.64336 -4.6433625
## 4 20 16.02560 3.9743989
## 5 14 14.83958 -0.8395765
## 6 13 15.37284 -2.3728412
It is interesting to see that the residual plot is not normally distributed. Does this suggest that we should continue using non-parametric analysis techniques?
Using the ntile function from dplyr, the lower tertile will be assigned value 1 (~ negative residual), middle tertile value 2 and upper tertile value 3 (~positive residual).
“The mini–mental state examination (MMSE) is a 30-point questionnaire that is used extensively in clinical and research settings to measure cognitive impairment. It is commonly used in medicine and allied health to screen for dementia. It is also used to estimate the severity and progression of cognitive impairment and to follow the course of cognitive changes in an individual over time; thus making it an effective way to document an individual’s response to treatment.Administration of the test takes between 5 and 10 minutes and examines functions including registration (repeating named prompts), attention and calculation, recall, language, ability to follow simple commands and orientation. […] Any score of 24 or more (out of 30) indicates a normal cognition. Below this, scores can indicate severe (≤9 points), moderate (10–18 points) or mild (19–23 points) cognitive impairment.” (Wikipedia.org)
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), MMSE_cut) ~ thirtile,
## data = .)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 2800 507 399 29.5 62.8
## thirtile=3 2745 284 392 30.0 62.8
##
## Chisq= 62.8 on 1 degrees of freedom, p= 2e-15
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), ADAS11_cut) ~ thirtile,
## data = .)
##
## n=5540, 5 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 2799 217 169 13.4 27.7
## thirtile=3 2741 119 167 13.6 27.7
##
## Chisq= 27.7 on 1 degrees of freedom, p= 1e-07
“The ADAS13 was included as a global measure of cognitive function. ADAS13 is a test battery developed to assess severity of cognitive impairment associated with AD and includes subtests and clinical evaluations assessing memory function, reasoning, language function, orientation and praxis. The ADAS13 is a modified version of the original ADAS-Cog-11, adding a cancellation task and a delayed free recall task. The higher the scores, the more severe impairment of cognitive function.” (Mofrad et al., 2021)
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), ADAS13_cut) ~ thirtile,
## data = .)
##
## n=7290, 27 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 3649 255 206 11.9 24.1
## thirtile=3 3641 157 206 11.8 24.1
##
## Chisq= 24.1 on 1 degrees of freedom, p= 9e-07
## Warning in pchisq(chi, df, lower.tail = FALSE): NaNs produced
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), ADASQ4_cut) ~ thirtile,
## data = .)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 3659 0 0 NaN NaN
## thirtile=3 3658 0 0 NaN NaN
## Warning in pchisq(x$chisq, df, lower.tail = FALSE): NaNs produced
##
## Chisq= 0 on -1 degrees of freedom, p= NA
“The clinical dementia rating (CDR) scale is commonly used to diagnose dementia due to Alzheimer’s disease (AD). The sum of boxes of the CDR (CDR-SB) has recently been emphasized and applied to interventional trials for tracing the progression of cognitive impairment (CI) in the early stages of AD.” (Tzeng et al., 2022)
See Table 3 for explanation on the staging category (O’Bryant et al., 2012)
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), CDRSB_cut) ~ thirtile,
## data = .)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 3659 433 341 24.6 50.8
## thirtile=3 3658 252 344 24.5 50.8
##
## Chisq= 50.8 on 1 degrees of freedom, p= 1e-12
“The DSST (Digit Symbol Substitution Test) is a paper-and-pencil cognitive test presented on a single sheet of paper that requires a subject to match symbols to numbers according to a key located on the top of the page. The subject copies the symbol into spaces below a row of numbers. The number of correct symbols within the allowed time, usually 90 to 120 seconds, constitutes the score.” (Jaeger, 2018) The lower the scores, the more severe impairment of cognitive function.
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), DIGITSCOR_cut) ~
## thirtile, data = .)
##
## n=4029, 3288 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 2146 103 78.2 7.89 16.6
## thirtile=3 1883 47 71.8 8.59 16.6
##
## Chisq= 16.6 on 1 degrees of freedom, p= 5e-05
The Functional Activities Questionnaire is used to assess an individual’s functional abilities in daily living activities. It is a caregiver-based questionnaire that helps evaluate how well a person is able to perform various instrumental activities of daily living (IADLs) and basic activities of daily living (ADLs). (ChatGPT) Sum scores (range 0-30). There is no established cut-off score for IADL impairment on the FAQ. A higher score relates to more severe impairment in functional abilities.
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), FAQ_cut) ~ thirtile,
## data = .)
##
## n=7308, 9 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 3659 425 364 10.3 21.3
## thirtile=3 3649 304 365 10.3 21.3
##
## Chisq= 21.3 on 1 degrees of freedom, p= 4e-06
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), LDELTOTAL_cut) ~
## thirtile, data = .)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 3659 120 223 47.7 96.7
## thirtile=3 3658 329 226 47.2 96.7
##
## Chisq= 96.7 on 1 degrees of freedom, p= <2e-16
Reference literature: doi: 10.1111/j.1532-5415.2005.53221.x
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), MOCA_cut) ~ thirtile,
## data = .)
##
## n=3896, 3421 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 1819 328 418 19.4 39.2
## thirtile=3 2077 545 455 17.8 39.2
##
## Chisq= 39.2 on 1 degrees of freedom, p= 4e-10
The RAVLT was included as a measure of memory function. In this test, the participants are asked to recall words from a list of 15 nouns immediately after each of five learning trials and after a short and a long delay. Two measures known to be sensitive to cognitive changes in patients with AD were included in the present study: Immediate recall (RAVLT-Im): the number of correct responses across the immediate recall of the five learning trials; percent forgetting (RAVLT-PF): the score on the fifth learning trial minus the score on the long delayed recall, divided by the score obtained on the fifth learning trial. The lower the scores, the more severe impairment of cognitive function.
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), RAVLT_forgetting_cut) ~
## thirtile, data = .)
##
## n=7299, 18 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 3651 68 83.5 2.89 5.77
## thirtile=3 3648 100 84.5 2.85 5.77
##
## Chisq= 5.8 on 1 degrees of freedom, p= 0.02
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), RAVLT_immediate_cut) ~
## thirtile, data = .)
##
## n=7299, 18 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 3651 47 142 63.3 127
## thirtile=3 3648 238 143 62.6 127
##
## Chisq= 127 on 1 degrees of freedom, p= <2e-16
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), RAVLT_learning_cut) ~
## thirtile, data = .)
##
## n=7299, 18 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 3651 95 126 7.47 14.9
## thirtile=3 3648 158 127 7.37 14.9
##
## Chisq= 14.9 on 1 degrees of freedom, p= 1e-04
## Warning in pchisq(chi, df, lower.tail = FALSE): NaNs produced
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), RAVLT_perc_forgetting_cut) ~
## thirtile, data = .)
##
## n=7290, 27 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 3642 0 0 NaN NaN
## thirtile=3 3648 0 0 NaN NaN
## Warning in pchisq(x$chisq, df, lower.tail = FALSE): NaNs produced
##
## Chisq= 0 on -1 degrees of freedom, p= NA
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), TRABSCOR_cut) ~
## thirtile, data = .)
##
## n=7252, 65 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 3619 503 386 35.8 72.5
## thirtile=3 3633 274 391 35.2 72.5
##
## Chisq= 72.5 on 1 degrees of freedom, p= <2e-16
The original version of the ECog is an informant-based measure of cognitively-relevant everyday abilities comprised of 39 items, covering six cognitively-relevant domains: Everyday Memory, Everyday Language, Everyday Visuospatial Abilities, and Everyday Planning, Everyday Organization, and Everyday Divided Attention. Ratings are made on a four-point scale: 1 = better or no change compared to 10 years earlier, 2 = questionable/occasionally worse, 3 = consistently a little worse, 4 = consistently much worse. (Tomaszewski Farias et al., 2012)
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), EcogPtDivatt_cut) ~
## thirtile, data = .)
##
## n=3888, 3429 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 1820 93 96.3 0.110 0.211
## thirtile=3 2068 110 106.7 0.099 0.211
##
## Chisq= 0.2 on 1 degrees of freedom, p= 0.6
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), EcogPtLang_cut) ~
## thirtile, data = .)
##
## n=3919, 3398 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 1830 133 111 4.56 8.78
## thirtile=3 2089 100 122 4.11 8.78
##
## Chisq= 8.8 on 1 degrees of freedom, p= 0.003
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), EcogPtMem_cut) ~
## thirtile, data = .)
##
## n=3925, 3392 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 1828 59 65.4 0.618 1.18
## thirtile=3 2097 79 72.6 0.556 1.18
##
## Chisq= 1.2 on 1 degrees of freedom, p= 0.3
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), EcogPtOrgan_cut) ~
## thirtile, data = .)
##
## n=3855, 3462 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 1787 129 125 0.119 0.228
## thirtile=3 2068 136 140 0.106 0.228
##
## Chisq= 0.2 on 1 degrees of freedom, p= 0.6
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), EcogPtPlan_cut) ~
## thirtile, data = .)
##
## n=3915, 3402 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 1828 133 104 7.84 15.2
## thirtile=3 2087 86 115 7.14 15.2
##
## Chisq= 15.2 on 1 degrees of freedom, p= 1e-04
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), EcogPtVisspat_cut) ~
## thirtile, data = .)
##
## n=3897, 3420 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 1827 117 112 0.260 0.504
## thirtile=3 2070 116 121 0.239 0.504
##
## Chisq= 0.5 on 1 degrees of freedom, p= 0.5
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), EcogPtTotal_cut) ~
## thirtile, data = .)
##
## n=3919, 3398 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 1830 121 107 1.95 3.76
## thirtile=3 2089 103 117 1.77 3.76
##
## Chisq= 3.8 on 1 degrees of freedom, p= 0.05
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), EcogSPDivatt_cut) ~
## thirtile, data = .)
##
## n=3913, 3404 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 1843 251 215 6.06 12.3
## thirtile=3 2070 190 226 5.76 12.3
##
## Chisq= 12.3 on 1 degrees of freedom, p= 5e-04
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), EcogSPLang_cut) ~
## thirtile, data = .)
##
## n=3989, 3328 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 1872 188 169 2.11 4.29
## thirtile=3 2117 157 176 2.03 4.29
##
## Chisq= 4.3 on 1 degrees of freedom, p= 0.04
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), EcogSPMem_cut) ~
## thirtile, data = .)
##
## n=3989, 3328 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 1872 179 150 5.74 11.7
## thirtile=3 2117 125 154 5.57 11.7
##
## Chisq= 11.7 on 1 degrees of freedom, p= 6e-04
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), EcogSPOrgan_cut) ~
## thirtile, data = .)
##
## n=3850, 3467 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 1791 231 217 0.950 1.93
## thirtile=3 2059 216 230 0.893 1.93
##
## Chisq= 1.9 on 1 degrees of freedom, p= 0.2
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), EcogSPPlan_cut) ~
## thirtile, data = .)
##
## n=3959, 3358 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 1855 267 226 7.44 15.1
## thirtile=3 2104 197 238 7.06 15.1
##
## Chisq= 15.1 on 1 degrees of freedom, p= 1e-04
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), EcogSPVisspat_cut) ~
## thirtile, data = .)
##
## n=3953, 3364 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 1846 233 205 3.85 7.8
## thirtile=3 2107 191 219 3.60 7.8
##
## Chisq= 7.8 on 1 degrees of freedom, p= 0.005
## Call:
## survdiff(formula = Surv(as.integer(VISCODE), EcogSPTotal_cut) ~
## thirtile, data = .)
##
## n=3981, 3336 observations deleted due to missingness.
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## thirtile=1 1871 246 215 4.61 9.39
## thirtile=3 2110 192 223 4.42 9.39
##
## Chisq= 9.4 on 1 degrees of freedom, p= 0.002